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ABSTRACT 

In most reliability studies, the precision of a 
reliability estimate varies inversely with the number of examinees 
(sample sizel . Thus, to achieve a given level of accuracy, some 
oinimui sanple size is required. An approximation for this minimum 
size may be made if some reasonable assumptions regarding the mean 
and standard deviation of the test score distribution can be made. To 
facilitate the computations, tables are developed based on the 
comprehensive Tests of Basic Skills- The tables may be used for tests 
ranging in length fro» 5 to 30 items, with percent cutoff scores of 
60%, 76«, or B0<, and with examinee populations for which the test 
difficulty can be described as low, moderate, or high, and the test 
variability as low or moderate. The tables also reveal that 'for a 
given degree of accuracy, an estimate of kappa would require a 
considerably greater number of examinees than would an estimate of 
the raw agreement index. (Author) 
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ABSTRACT 

In most reliability studies, the precision of a reliability 
estimate varies inversely with the number of exaniinees (sample 
size). Thus, to achieve a given level of accuracy, soma minimum 
sample size is required. An approximation for this minimum size 
way be made if some reasonable assumptions regarding the mean and 
star.dard deviation of the test score distribution can be made. 
To facilitate the computations, tables are developed based on the 
Comprehensive Tests of Basic Skills. The tables may be used for 
tests ranging in length from five to thirty itess, with percent 
cutoff scores of 60%, 70%, or 80%, and with er.nmineG population^ 
for which the test difficulty can be described as low, moderate, 
or high, and the test variability as low or moderate. The tables 
also reveal that for a given degree of accuracy, an estimate of 
kappa would require a considerably greater nucber of examinees 
than would an estimate of the raw agreement index. 



This vork was performed pursuant to Grant No. NIE-G-7 8-0087 with 
the National Institute of Education, Department of Health, Educa- 
tion, and Welfare, Huynh Huynh, Principal Investigator, 
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1. INTRODUCTION 

In many applications of educational and psychological tasting, 
an empirical demonstration o£ the reliability o£ the measuring In^- 
strtiment Is desirable. Such dsmonstratlon is most meaningful when 
the estimate for the reliability has been obtained with a reason^ 
able degree of accuracy. That Is, the standard error of estimate 
must be within some acceptable limit. In most Instances, the 
standard error is a decreasing function of the number of examinees 
(sample size) to be included in the reliability study. Thus, some 
minimum sample size is needed to achieve a given level of precision. 
The purpose of this paper is to illustrate how this sample size can 
be assessed in estimating the reliability of mastery tests. 

The paper consists of three major parts. The first part pre- 
sents an overview of the procedures for estimating two reliability 
indices for mastery tests by using data collected from one test ad- 
ministration. The use of the estimation process to determine the 
minimum sample size is illustrated in the second part. Finally, a 
set of tables is developed to facilitate the determination of the 
miniomm sample size in reliability studies for mastery tests. 

2. OVERVIEW OF SINGLE-ADMINISTRATION 
ESTIMATES FOR RELIABILITY 

Mastery tests are commonly used to classify examinees into two 
achievement categories, usually referred to as mastery and non- 
mastery. The reliability of such tests is often viewed as the con- 
sistency of mastery-nonmastery decisions. It may be quantified via 
the raw agreement index (p) or the kappa index (k). The p index is 
simply the combined proportion of examinees classified consistently 
as masters or nonmasters by two repeated testings using the same 
form or two equivalent forms of a mastery test. The kappa index, 
on the ocher hand, takes Into account the level of decision con- 
sistency which would result from random category assignment. It 
expresses the extent to which the test scores improve the con- 
sistency of decisions beyond the chance level. 
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Though both p and k are defined in terms of repeated testings, 
there are many practical situations in which they laay be estiisated 
from the scores collected from a single test administration (Huynh, 
1976) . The estimation process assumes that the test scores con- 
form to a beta-binomial (negative hypergeometric) model, and may be 
carried out via formulae, tables, and a computer program reported 
elsewhere (Huynh, 1978; 1979). The data reported by Subkoviak 
(1978) and by Huynh and Saunders (1979) tend to indicate that the 
beta-binomial model yields reasonably accurate estimates for p and 
K in situations involving educational tests such as the Scholastic 
Aptitude Test and the Comprehensive Teat of Basic Skills. 

The beta-binomial model also provides asymptotic (large sample) 
standard errors for the estimates. Simulation studies indicate that 
the asymptotic sta>:tdard errors tend to underestimate the actual 
standard errors when the sample size is small (Huynh, 1980). The 
degree of underestimation is not substantial when the sample has 
sixty or more examinees. Since the beta-binomial model will be 
used throughout the remaining part of this paper, a minimum sample 
size of sixty examinees will be assumed to hold uniformly ior all 
cases under consideration. 

3. ILLUSTRATIONS FOR SAMPLE SIZE 
DETEBMINATIQN 

The standard error (s.e.) of estimates for p and for k are 
functions of sample size m. The quantity G « s.e. x ^ is 
asymptotically (i.e., in large samples) a constant, however. This 
constant depends only on the number of itans (n), the mean (y) 
and standard deviation (a) of the test scores, and the cutoff score 
(c). Given the availability of these parameters, the value of G 
may be determined via the tables or the computer program presented 
elsewhere (Huynh, 1978). Once G is determined, a minimum sample 
size m can be calculated which will restrict the standard error of 
estimate to whatever tolerable range is required. 

Suppose, for example, that an estimate of < is needed for a 
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short (0*6 IteiBs) teat to be used with a particular population of 
students. Passing or master; on the test Is to be granted 1£ an 
eacaioinee attains a score of S or 6. Further, suppose that we want 
the standard error of this estimate to be smaller than 10% of tc, 
that is» s.e. (ic) £ .i6k. 

What sample size would be needed to obtain the specified 
degree of accuracy in the estimate? To answer this question using 
the above mentioned Hu3mh procedure, a preliminary knowledge of 
the test mean and standard deviation is needed. Suppose past data 
suggest that the students are generally well-prepared on the con- 
tent of the test in question and can be escpected to be fairly 
homogeneous in achlevesnent. We might suppose that in the population 
the mean will be 5.0 and the standard deviation will be 1.2. Using 
these values, and the cutoff score of 5, a value of G can be read 
from the tables (or computed): GCk) - .7390. If the population 
mean and standard deviation are as given, then, assuming the beta- 
binomial model, the population value of k is .3778. These results 
are then used to estimate the sample size needed to bring the 
standard error of estimate with the desired limits (i.e. less than 
.IOk). 

Since the standard error of estimate is approximately G/v^, 
the standard error must be such that 

^ < .10k 
v'm" 

or, equlvalently, 

m > [G(ic)/.10ic]^. 

For this example, then, 

m >. C.7390/(.10)(.3778)]^ - 382.62. 

Thus, to have no more than 10% relative error requires that at 
lease 383 examinees be tested to estimate k. 

A similar computation can be made for s.e. (p) <^ .lOp when the 
above asstmed population values hold. Thus, using the tables, 
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G(p) - .3210. 



p » .7532» 



and 



m > CG(p)/..lOp]^ - 18.16. 



Because of the previously mentioned problesxs of underestimation in 
small samples, a sample size of at least sixty is recossaended re- 
gardless of the above computation. 

It might be disheartening to note that a much larger sample 
size is needed to keep the standard error of the ic estimate within 
the desired li^aits than is required when an estimate of p is used. 
However, the standard error for ic is much larger than that of p 
(Huynh, 1978). Thus, for the same relative size of errors of es- 
timation, larger samples are needed to estimate tc than to estimate 
p. It could be argued that the same degree of accuracy of esti- 
mation is not required. If so. then a less accurate estimate of ic 
would allow a smaller sample size. 

The above illustration presumes that the mean and standard de- 
viation of the test scores can be projected prior to the real test 
administration. In a number of instances involving the use of 
standardized tests for a heterogeneous group of students, reasonable 
assumptions may be made, which will yield projected values for both 
M and a. For example, when an n-itoa multiple-choice is built to 
maximize the discrimination among individual ecaminees, it is not 
unreasonable to assume that the test mean is half way between the 
expected chance score and the maximum score n. and that the stand- 
ard deviation is about one-sixth of the test score range from 0 
to n. (If there are A options per item, the expected chance score 
is n/A.) In other words, it is not unreasonable to presuo^ that 



For example, consider a test consisting of 10 four-option items. 
Then A * 4, and the projected mean and standard deviation are 



and 



U - (n+n/A)/2 
a " n/6. 
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p " 6.25 and o « 1.66667. Presuning a cutoff score of c - 6, it 

may be found that p • .6140, G(p) - .3661, k - .1118, and G(ic) - 

.8213. If a relative error of 5% is acceptable for p, then a 

2 

sample of at least C .Seei/C.OSt.eiAO)] - 143 students would be 
needed. On the other hand, a relative error of 25% for kappa 
would require [. 8213/ (.25x. 1118)3 ^ - 864 students. 

4. PRACTICAL CONSIDERATIONS IN SETTING SAMPLE 
SIZE IN BASIC SKILLS TESTING 

Some general formulae are given for expressing the relation- 
ships among s.e., G, m, p, k, and the proportion of sampling error 
desired in an estimate. These general expressions will then be 
used in a series of simulations designed to explore their Qrpical 
numerical values foe real tests. Tables are developed to help the 
practitioner decide on the sample size needed to obtain estimates 
of p and K for various degrees of precision. 

General expressions 

Since G s.e. x ^ is a constant for large samples, this ex- 
pression forms the basis for the formulations in this section. In 
the previous section .10 and .05 wer*? used as examples of desired 
degrees of precision for a sample estiiaate of p. In general, we 
will call this quantity y» using y„ and y to distinguish precisions 

p IC 

desired for p and ic, respectively. Thus, the general expressions 
for minimum sample size are: 

t2 

GCp) 

m 



and 

-i2 



m > 



2 

A further simplification is to let R(p) « CG(p)/p] and 
2 

R(ic) • [G(p)/<] . The above expressions for minimum sample size, 
m« become 
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and 

These expressions will allow mlninum sample size to be determined 
from knowledge o£ tvo quantities, R and y, 

Determinin;|t typical values of R(p) and R(ic) 

In practical applications, the values R(p) and R()c) depend on 
a test score, distribution which is not yet available. So, as in the 
previous section, conjectures must be made regarding the mean and 
standard deviation of the test score in order to project the minimum 
sample size. 

In this section, typical values for R(p) and R(k) will be re- 
ported for practical testing situations involving the assessment of 
basic skills. Several combination of test length, difficulty, 
variability, and cutoff scores will be used. To arrive at the 
values of R(p) and R(ic) reported in Tables 1-3, the following series 
of steps was taken. 

First, a series of subtests was developed, using items found 
in the Comprehensive Test of Basic Skills (CTBS), Form S, Level 1. 
The items composing each subtest were randomly selected from one of 
five CTBS content areas, to reflect a variety of subjects and 
skills. For each content area, subtests were constructed with 5, 
10, 15, 20, 25, and 30 items, producing a total of 30 subtes^is. 

Second, the administration of the subtests \»as simulated 
using actual student responses. Data for the simulation came from 
5,5A3 students, comprising a systematic sample (every tenth case) 
of the third grade students tested using Level 1 of the CTBS by 
the 1978 South Carolina Statewide Testing Program. Trom the 
students* responses to each item in the CTBS, raw scores were gen- 
erated for each student on all 30 subtests. 

Third, values of the mean and standard deviation of raw scores 
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on each test were obtained. District means and standard deviations 
were calculated for each school district with 40 or more students 
in the sample. For each of the 30 subtests* means and standard 
deviations ware plotted in a bivariate scatter diagram. The 
scatter-plots were divided into areas representing different cate- 
gories of test difficulty and variability. Then districts were 
selected with means and standard deviations considered to be typical 
of six categories of difficulty and variability. These six catt.- 
goriea (teats of low» moderate, and high difficulty, with low and 
moderate variability) were chosen to represent types of test score 
distributions typically encountered in mastery testing. 

Fourth, the typical values obtained in the previous step were 
used to determine R(p) and R(k). For each of the 30 subtests, the 
computer program described elsewhere (Huynh, 1978) was used to 
obtain estimates of G(p) , p, 6((c) , and k when the cutoff scores 
were equivalent to 602, 70%, and 802. These data were used to 
calculate R(p) and RCk) in each c;.se. 

Finally, the values of R(p) and RCxi) obtained above were 
averaged over the five CTBS content areas and the resulting values 
were compiled in tabular form. Tables 1, 2, and 3 provide values 
of R(p) and R(k) for percent cutoff scores of 602, 702, and 802, 
respectively. 

The data needed to enter the tables are: (1) test length 
(n), (2) an idea of test difficulty (high, moderate, or low), (3) 
test variability (low or moderate), and (4) percentage cutoff 
score ''.602, 702, or 802). The minimum sample size needed is simply 
R/y , that is, the value of R obtained from the tables divided by 
the square of the acceptable proportion of sampling error in the 
estimate. 

Numerical example 

Suppose a study is planned to assess the reliability of a 
twenty-item test (n « 20) using the kappa index when a cutoff score 
of 14 (c - 702) is employed. The students for whom the test is 
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Values of R for p and k for Six Categories o£ 
Tests at the Percent Cutoff Score of 60% 



Test Category 
(dlff) (var) 




3 


10 


Number of 
15 


Items 
20 


25 


30 


High 


Low 


(P) 

(K) 


0.219 
5.349 


0.075 
1.623 


0.050 
0.666 


0.031 
0.391 


0.023 
0.307 


0.018 
0.209 


High 


Mod 


(P) 

(«c) 


2* 589 


o.Ool 
0.908 


0.036 
0.327 


ft ft'JC 

0.280 


Q.UXH 

0.209 


0.139 


Mod 


Low 


(P) 

(•c) 


0.244 
5.809 


0.085 
1.485 


0.056 
0.613 


0.032 
0.367 


0.025 
0.269 


0.020 
0.200 


Mbd 


Mod 


(p) 

(ic) 


0.148 
2.215 


0.068 
0.838 


0.036 
0.312 


0.027 
0.266 


0.021 
0.198 


0.015 
0.126 


Low 


Low 


(p) 

(k) 


0.199 
5.502 


0.095 
1.345 


0.044 
0.560 


0.031 
0.365 


0.025 
0.247 


0.020 
0.186 


Low 


Mod 


(p) 

(k) 


0.142 
2.371 


0.068 
0.770 


0.032 
0.298 


0.024 
0.249 


0.020 
0.176 


0.016 
0.128 



Intended are known to be a homogeneous group of relatively high 

ability. Thus, it might be expected that the test would be of low 

difficulty (i.e., easy), with low variability. Let us say that a 

fairly precise estimate of k is desired, so is set at .05. 

Entering Table 2, in the row corresponding to low difficulty and 

low variability, it if found that R(k) for n " 20 items ±d .362. 

The minimum sample size needed to estimate kappa with 5Z allowable 

2 2 

error is then computed as m - KM/y^ • .362/ (.05) - 144.8. 
Thus, a sample of at least 145 students is necessary to achieve the 
desired degree of precision. If reliability is to be determined 
via the raw agreement index p, a similar procedure is followed 
using R(p) and Yp* Again, at least 60 students should be used in 
the sample, even if it is found that m < 60. 
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TABLE 2 

Values of- R for 9 and k for Six Categories of 
Testa at the Percent Cutoff Score of 70% 



Test Category 
(dlff) (var) 




5 


Number of 
10 15 


Tf~ finis 

20 


25 


30 


High 


Low 


(P) 
(ic) 


0.219 
5.349 


0.075 
1.623 


0.O46 
0.7/6 


0.029 

0,455 


0.022 
0.410 


0.017 
0.272 


High 


Mod 


(P) 
(ic) 


0.164 
2.589 


0.061 
0.909 


0.033 
0.360 


0.023 
0.324 


0.017 
0.276 


0.013 
0.178 


Mod 


Low 


(P) 

M 


0.2>*4 
5.809 


0.085 
1.485 


0.053 
0.646 


0.031 
0.396 


0.023 
0.322 


0.019 
0.242 




Mod 


(P) 
(<) 


0.148 
2.215 


0.068 
0.838 


0.035 
0.321 


0.026 
0.289 


0.019 
0.237 


0.014 
0.149 


Low 


Low 


(P) 
(«) 


0.199 
5.502 


0.095 
1.345 


0.050 
0.512 


0.031 
0.362 


0.024 
0.265 


0.019 
0.203 


Lou 


Mod 


(P) 
M 


0.142 


0.068 
0.770 


0.036 
0.280 


0.023 
0.254 


0.019 
0.190 


0.015 
0.137 



Some observations on the cabled values 



In every case R(k) > R(p) . This fact Implies that the sample 
size necessary to estimate kappa will be larger than that needed to 
estimate p, for any fixed degree of precision, y. As noted previous- 
ly, practical limitations may require that larger proportions of 
error be tolerated when estimating kappa than when estimating p. 

R-values for the case of low variability are larger than those 
for moderate variability . If there is doubt about the expected 
degree of variability, the value of R for the low variability case 
would produce the more conservative estimate of m. 

R decreases as the number of tust items increases . The re- 
lationship between R and n is not linear, however. Hence, linear 
Interpolation would not be appropriate for determining R for non- 
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TABLE 3 

Values of R and p and k for Six Categories o£ 
Tests at the Percent Cutoff Score of 80% 



Test Category 
(dlff) (var) 




5 


10 


Number of 
15 


Items 
20 


25 


30 


Hieh 

mm Afffc»4 


mWw 


(k) 


0.132 
7.076 


0.063 
2.805 


0 032 
1.494 


0 021 
1.055 


0 018 
0.887 


0 013 
0.660 


High 


Mod 


(P) 
(<) 


0.098 
3.510 


0.045 
1.678 


0.024 
0.608 


0.013 
0.717 


0.015 
0.568 


0.011 
0.404 


Mod 


Low 


(P) 

M 


0.174 
6.831 


0.064 
2.283 


0.038 
1.087 


0.025 
0.812 


0.020 
0.640 


0.015 
0.558 


Mod 


Mod 


(?) 
(k> 


0.113 
2.633 


0.047 
1.337 


0.026 
0.484 


0.021 
0.571 


0.017 
0.458 


0.012 
0.311 


Low 


Low 


(p) 

(ic) 


0.189 
5.849 


0.060 
1.906 


0.044 
0.652 


0.029 
0.611 


0.022 
0.471 


0.017 
0.417 


Low 


Mod 


(p) 


0.122 
2.675 


0.046 
1.113 


0.029 
0.348 


0.023 
0.430 


0.018 
0.325 


0.014 
0.248 



tabled values of n. The value of R listed for the largest tabled 
n less than tha actual number of items should yield a conservative 
estimate for m* For escample, suppose the test considered in the 
numerical example above actually contained 22 items. The tabled 
value of R corresponding to n - 25 would produce an underestimate 
of m, and the resulting proportion cf error in estimating kappa 
would exceed y . The R-value for n • 20 would overestimate m, and 
the observed proportion of error would then be less than y . 

It 

The relationships between R and test difficulty or cutoff scores 
are more complex . No simple trends can be observed in the tables. 
In many testing situations, the cutoff score typically ranges from 
60% to 80% correct. For cutoff scores falling between the values 
in the tables, find R for both bracketing values and use the larger. 
Again, consider the situation in the numerical example above. 
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Suppose t' :. cutoff score was 13 (65% correct). Froic Tables 1 aud 
2, the values of R corresponding to c • 60% and 70% are .363 and 
.362, respectively. The larger of these (corresponding to c » 60%) 
should provide a r&asooable value for R. 

4. CONCLUSIONS 

In this paper, an approxiination method has been presented for 
determining the minimum sample size necessary to achieve a speci- 
fied degree of precision in estimating raw agreement (p) and kappa 
(ic) indices of reliability for mastery tests. The method uses the 
quantity R which can be calculated for known test score distri- 
butions. Tables of R have been constructed for test score dis- 
tribu'-ions typically found in mastery testing, for a variety of 
test lengths and cutoff scores. In addition, sugzestions have been 
made for obtaining reasonable estimates of R for situations not 
directly covered by the tables. 

Of course, precision is only one of the factors that must be 
considered in any study. Feasibility, cost, and classroom manage- 
ment considerations also play important roles. However, knowledge 
of necessary sample sizes should facilitate and simplify the 
planning of reliability studies. The tables presented here should 
be particularly useful for tests involving the basic skills, and 
perhaps other tests of similar construction. 
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